This project addresses inequality of educational opportunity in U.S. high schools. Here we will focus on average student performance on the ACT or SAT exams that students take as part of the college application process.
We expect a range of school performance on these exams, but is school performance predicted by socioeconomic factors?
Public schools in the United States are designed with the goal of giving every student an equal opportunity to learn and succeed. However, this ideal of fairness can be very difficult to uphold, especially with the large disparities and differences between schools across the US. There hundreds, if not thousands, of different factors that can be wildly different from school to school, and providing an equal education across all these differences can be difficult, and often is not achieved. Thus, there exists a problem of education inequality in the US, where students may be disadvantaged due to factors they have no control over. To combat this issue, it is important to study what factors may be correlated with education inequality, which should hopefully point in the direction of future research that can determine the root cause of some of this inequality.
We will be examining how a number of socioeconomic factors might correlate with unequal education in the US. In particular, we will be looking at the percentage of students receiving free & reduced lunch at a school and the median income, percentage of married individuals, unemployment rate, and percentage of college graduates in the geographic location of the school to try to identify and quantify some of the correlations between these variables and the quality of education the student is provided. Since there are very few reliable metrics for comparing academic performance on a national level, we will compare average ACT scores of different public high schools and use census data, along with data the school provides, to determine the socioeconomic factors at play. We’ll be using this data to determine the answer to questions like do these socioeconomic factors correlate to performance on the ACT? If so, how do they correlate? Which factors correlate the strongest and how certain are we of their correlation? Can we predict ACT scores based on one or a number of these metrics? We’ll also look at how these factors and test scores might vary between two states (Washington and New York) and compare these results to those we saw on the national level as well. All of these questions should help us to understand some ways in which socioeconomic factors may correlate with inequality in the public school system in the US. The answers to these question should provide some avenues for further research and, hopefully, these may lead to policy addressing this education inequality.
This project utilizes two data sets. The primary data set is the EdGap data set from EdGap.org. This data set from 2016 includes information about average ACT or SAT scores for schools and several socioeconomic characteristics of the school district. The secondary data set is basic information about each school from the National Center for Education Statistics.
All socioeconomic data (household income, unemployment, adult educational attainment, and family structure) are from the Census Bureau’s American Community Survey.
EdGap.org report that ACT and SAT score data is from each state’s department of education or some other public data release. The nature of the other public data release is not known.
The quality of the census data and the department of education data can be assumed to be reasonably high.
EdGap.org do not indicate that they processed the data in any way. The data were assembled by the EdGap.org team, so there is always the possibility for human error. Given the public nature of the data, we would be able to consult the original data sources to check the quality of the data if we had any questions.
The school information data is from the National Center for Education Statistics. This data set consists of basic identifying information about schools and can be assumed to be of reasonably high quality. As for the EdGap.org data, the school information data is public, so we would be able to consult the original data sources to check the quality of the data if we had any questions.
#readxl lets us read Excel files
library(readxl)
#GGally has a nice pairs plotting function
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#skimr provides a nice summary of a data set
library(skimr)
#leaps will be used for model selection
library(leaps)
#kableExtra will be used to make tables in the html document
library(kableExtra)
#latex2exp lets us use LaTex in ggplot
library(latex2exp)
#tidyverse contains packages we will use for processing and plotting data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag() masks stats::lag()
\(\rightarrow\) Load the data set contained in the file EdGap_data.xlsx and name the data frame edgap.
edgap <- read_excel("EdGap_data.xlsx")
\(\rightarrow\) Look at the first few rows of the data frame.
head(edgap)
## # A tibble: 6 × 7
## `NCESSCH School ID` `CT Unemployment Rate` `CT Pct Adults w… `CT Pct Childre …
## <dbl> <dbl> <dbl> <dbl>
## 1 100001600143 0.118 0.445 0.346
## 2 100008000024 0.0640 0.663 0.768
## 3 100008000225 0.0565 0.702 0.713
## 4 100017000029 0.0447 0.692 0.641
## 5 100018000040 0.0770 0.640 0.834
## 6 100019000050 0.0801 0.673 0.483
## # … with 3 more variables: CT Median Household Income <dbl>,
## # School ACT average (or equivalent if SAT score) <dbl>,
## # School Pct Free and Reduced Lunch <dbl>
Use the View function to explore the full data set.
View(edgap)
\(\rightarrow\) Load the data set contained in the file ccd_sch_029_1617_w_1a_11212017.csv and name the data frame school_info.
school_info <- read_csv("ccd_sch_029_1617_w_1a_11212017.csv")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 102181 Columns: 65
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (60): SCHOOL_YEAR, FIPST, STATENAME, ST, SCH_NAME, LEA_NAME, STATE_AGENC...
## dbl (3): SY_STATUS, UPDATED_STATUS, SCH_TYPE
## lgl (2): MSTREET3, LSTREET3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
\(\rightarrow\) Look at the first few rows of the data frame.
head(school_info)
## # A tibble: 6 × 65
## SCHOOL_YEAR FIPST STATENAME ST SCH_NAME LEA_NAME STATE_AGENCY_NO UNION
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2016-2017 01 ALABAMA AL Sequoyah Sc… Alabama … 01 <NA>
## 2 2016-2017 01 ALABAMA AL Camps Alabama … 01 <NA>
## 3 2016-2017 01 ALABAMA AL Det Ctr Alabama … 01 <NA>
## 4 2016-2017 01 ALABAMA AL Wallace Sch… Alabama … 01 <NA>
## 5 2016-2017 01 ALABAMA AL McNeel Sch … Alabama … 01 <NA>
## 6 2016-2017 01 ALABAMA AL Alabama You… Alabama … 01 <NA>
## # … with 57 more variables: ST_LEAID <chr>, LEAID <chr>, ST_SCHID <chr>,
## # NCESSCH <chr>, SCHID <chr>, MSTREET1 <chr>, MSTREET2 <chr>, MSTREET3 <lgl>,
## # MCITY <chr>, MSTATE <chr>, MZIP <chr>, MZIP4 <chr>, LSTREET1 <chr>,
## # LSTREET2 <chr>, LSTREET3 <lgl>, LCITY <chr>, LSTATE <chr>, LZIP <chr>,
## # LZIP4 <chr>, PHONE <chr>, WEBSITE <chr>, SY_STATUS <dbl>,
## # SY_STATUS_TEXT <chr>, UPDATED_STATUS <dbl>, UPDATED_STATUS_TEXT <chr>,
## # EFFECTIVE_DATE <chr>, SCH_TYPE_TEXT <chr>, SCH_TYPE <dbl>, …
In a tidy data set, each column is a variable or id and each row is an observation. We again start by looking at the head of the data frame to determine if the data frame is tidy.
head(edgap)
## # A tibble: 6 × 7
## `NCESSCH School ID` `CT Unemployment Rate` `CT Pct Adults w… `CT Pct Childre …
## <dbl> <dbl> <dbl> <dbl>
## 1 100001600143 0.118 0.445 0.346
## 2 100008000024 0.0640 0.663 0.768
## 3 100008000225 0.0565 0.702 0.713
## 4 100017000029 0.0447 0.692 0.641
## 5 100018000040 0.0770 0.640 0.834
## 6 100019000050 0.0801 0.673 0.483
## # … with 3 more variables: CT Median Household Income <dbl>,
## # School ACT average (or equivalent if SAT score) <dbl>,
## # School Pct Free and Reduced Lunch <dbl>
The units of observation are the schools. Each school occupies one row of the data frame, which is what we want for the rows.
\(\rightarrow\) What variables are present? What types of variables are present?
skim(edgap)
| Name | edgap |
| Number of rows | 7986 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| NCESSCH School ID | 0 | 1 | 3.321869e+11 | 1.323638e+11 | 1.000016e+11 | 2.105340e+11 | 3.600085e+11 | 4.226678e+11 | 5.60583e+11 | ▇▆▆▅▇ |
| CT Unemployment Rate | 14 | 1 | 1.000000e-01 | 6.000000e-02 | 0.000000e+00 | 6.000000e-02 | 9.000000e-02 | 1.200000e-01 | 5.90000e-01 | ▇▃▁▁▁ |
| CT Pct Adults with College Degree | 13 | 1 | 5.700000e-01 | 1.700000e-01 | 9.000000e-02 | 4.500000e-01 | 5.500000e-01 | 6.800000e-01 | 1.00000e+00 | ▁▅▇▅▂ |
| CT Pct Childre In Married Couple Family | 25 | 1 | 6.300000e-01 | 2.000000e-01 | 0.000000e+00 | 5.200000e-01 | 6.700000e-01 | 7.800000e-01 | 1.00000e+00 | ▁▂▅▇▃ |
| CT Median Household Income | 20 | 1 | 5.202691e+04 | 2.422806e+04 | 3.589000e+03 | 3.659725e+04 | 4.683350e+04 | 6.136925e+04 | 2.26181e+05 | ▇▆▁▁▁ |
| School ACT average (or equivalent if SAT score) | 0 | 1 | 2.018000e+01 | 2.600000e+00 | -3.070000e+00 | 1.860000e+01 | 2.040000e+01 | 2.191000e+01 | 3.23600e+01 | ▁▁▂▇▁ |
| School Pct Free and Reduced Lunch | 0 | 1 | 4.200000e-01 | 2.400000e-01 | -5.000000e-02 | 2.400000e-01 | 3.800000e-01 | 5.800000e-01 | 1.00000e+00 | ▃▇▆▃▂ |
\(\rightarrow\) Is the edgap data frame tidy?
Yes, the data frame is tidy since each variable has its own column, each observation has its own row, and each value has its own cell.
We again start by looking at the head of the data frame to determine if the data frame is tidy.
head(school_info)
## # A tibble: 6 × 65
## SCHOOL_YEAR FIPST STATENAME ST SCH_NAME LEA_NAME STATE_AGENCY_NO UNION
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2016-2017 01 ALABAMA AL Sequoyah Sc… Alabama … 01 <NA>
## 2 2016-2017 01 ALABAMA AL Camps Alabama … 01 <NA>
## 3 2016-2017 01 ALABAMA AL Det Ctr Alabama … 01 <NA>
## 4 2016-2017 01 ALABAMA AL Wallace Sch… Alabama … 01 <NA>
## 5 2016-2017 01 ALABAMA AL McNeel Sch … Alabama … 01 <NA>
## 6 2016-2017 01 ALABAMA AL Alabama You… Alabama … 01 <NA>
## # … with 57 more variables: ST_LEAID <chr>, LEAID <chr>, ST_SCHID <chr>,
## # NCESSCH <chr>, SCHID <chr>, MSTREET1 <chr>, MSTREET2 <chr>, MSTREET3 <lgl>,
## # MCITY <chr>, MSTATE <chr>, MZIP <chr>, MZIP4 <chr>, LSTREET1 <chr>,
## # LSTREET2 <chr>, LSTREET3 <lgl>, LCITY <chr>, LSTATE <chr>, LZIP <chr>,
## # LZIP4 <chr>, PHONE <chr>, WEBSITE <chr>, SY_STATUS <dbl>,
## # SY_STATUS_TEXT <chr>, UPDATED_STATUS <dbl>, UPDATED_STATUS_TEXT <chr>,
## # EFFECTIVE_DATE <chr>, SCH_TYPE_TEXT <chr>, SCH_TYPE <dbl>, …
It is more difficult to assess whether this data frame is tidy because of the large number of columns with confusing names.
We have 65 columns, but many of the columns are identifying features of the school and are irrelevant for the analysis. We will select a subset of columns and do the analysis on the subset.
We want to select the following variables:
| name | description |
|---|---|
| NCESSCH | NCES School ID |
| MSTATE | State name |
| MZIP | Zip code |
| SCH_TYPE_TEXT | School type |
| LEVEL | LEVEL |
\(\rightarrow\) Use the select function to select these columns from the data frame.
school_info <- school_info %>%
select("NCESSCH", "MSTATE", "MZIP", "SCH_TYPE_TEXT", "LEVEL")
\(\rightarrow\) Examine the head of the new, smaller data frame
head(school_info)
## # A tibble: 6 × 5
## NCESSCH MSTATE MZIP SCH_TYPE_TEXT LEVEL
## <chr> <chr> <chr> <chr> <chr>
## 1 010000200277 AL 35220 Alternative School High
## 2 010000201667 AL 36057 Alternative School High
## 3 010000201670 AL 36057 Alternative School High
## 4 010000201705 AL 36057 Alternative School High
## 5 010000201706 AL 35206 Alternative School High
## 6 010000201876 AL 35080 Regular School Not applicable
\(\rightarrow\) Is the data frame tidy?
Yes: each variable has its own column, each observation has its own row, and each value has its own cell.
\(\rightarrow\) How many observations are in the EdGap data set?
nrow(edgap)
## [1] 7986
There are 7986 rows, so there are 7986 observations in the data set.
\(\rightarrow\) How many observations are in the school information data set?
nrow(school_info)
## [1] 102181
There are 102,181 rows, so there are 102,181 observations in the data set.
In any analysis, we might rename the variables in the data frame to make them easier to work with. We have seen that the variable names in the edgap data frame allow us to understand them, but they can be improved. In contrast, many of the variables in the school_info data frame have confusing names.
Importantly, we should be thinking ahead to joining the two data frames based on the school ID. To facilitate this join, we should give the school ID column the same name in each data frame.
\(\rightarrow\) View the column names for the EdGap data
names(edgap)
## [1] "NCESSCH School ID"
## [2] "CT Unemployment Rate"
## [3] "CT Pct Adults with College Degree"
## [4] "CT Pct Childre In Married Couple Family"
## [5] "CT Median Household Income"
## [6] "School ACT average (or equivalent if SAT score)"
## [7] "School Pct Free and Reduced Lunch"
To follow best practices, we should
_ between words in a variable name. Do not use blank spaces, as they have done.Pct, that can easily be written out.rate, percent, median, and average should all either be at the beginning or end of the name.We will use the rename function from the dplyr package to rename the columns.
#The new name for the column is on the left of the =
edgap <- edgap %>%
rename(id = "NCESSCH School ID",
rate_unemployment = "CT Unemployment Rate",
percent_college = "CT Pct Adults with College Degree",
percent_married = "CT Pct Childre In Married Couple Family",
median_income = "CT Median Household Income",
average_act = "School ACT average (or equivalent if SAT score)",
percent_lunch = "School Pct Free and Reduced Lunch"
)
\(\rightarrow\) Use the names function to see that the names have changed
names(edgap)
## [1] "id" "rate_unemployment" "percent_college"
## [4] "percent_married" "median_income" "average_act"
## [7] "percent_lunch"
\(\rightarrow\) View the column names for the school information data
names(school_info)
## [1] "NCESSCH" "MSTATE" "MZIP" "SCH_TYPE_TEXT"
## [5] "LEVEL"
The names can be improved for readability. We also have the constraint that we rename ncessch to id to be consistent with the edgap data.
Rename the columns of the school information data frame.
school_info <- school_info %>%
rename(id = "NCESSCH",
state = "MSTATE",
zip_code = "MZIP",
school_type = "SCH_TYPE_TEXT",
school_level = "LEVEL"
)
#Print the names to see the change
names(school_info)
## [1] "id" "state" "zip_code" "school_type" "school_level"
We will join the edgap and school_info data frames based on the school ID. We should first note that the id is coded differently in the two data frames:
typeof(edgap$id)
## [1] "double"
typeof(school_info$id)
## [1] "character"
While id is a number, it is a categorical variable and should be represented as a character variable in R.
Convert id in edgap id to a character variable:
We will use the mutate function from the dplyr package to rename the columns.
edgap <- edgap %>%
mutate(id = as.character(id))
#Check that the type has been converted to character
typeof(edgap$id)
## [1] "character"
We will now join the data frames. We want to perform a left join based on the school ID id so that we incorporate all of the school information into the edgap data frame.
edgap <- edgap %>%
left_join(school_info, by = "id")
Examine the head of the new data frame:
head(edgap)
## # A tibble: 6 × 11
## id rate_unemployment percent_college percent_married median_income
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 100001600143 0.118 0.445 0.346 42820
## 2 100008000024 0.0640 0.663 0.768 89320
## 3 100008000225 0.0565 0.702 0.713 84140
## 4 100017000029 0.0447 0.692 0.641 56500
## 5 100018000040 0.0770 0.640 0.834 54015
## 6 100019000050 0.0801 0.673 0.483 50649
## # … with 6 more variables: average_act <dbl>, percent_lunch <dbl>, state <chr>,
## # zip_code <chr>, school_type <chr>, school_level <chr>
\(\rightarrow\) How many missing values are there in each column? Give the number of missing values and the percent of values in each column that are missing.
Recall that missing values are coded in R with NA, or they may be empty. We want to convert empty cells to NA, so that we can use the function is.na to find all missing values. The function read_excel that we used to read in the data does this automatically, so we do not need to take further action to deal with empty cells.
print("Total NAs:")
## [1] "Total NAs:"
colSums(is.na(edgap))
## id rate_unemployment percent_college percent_married
## 0 14 13 25
## median_income average_act percent_lunch state
## 20 0 0 88
## zip_code school_type school_level
## 88 88 88
print("Percent NAs:")
## [1] "Percent NAs:"
colSums(is.na(edgap))/nrow(edgap)*100
## id rate_unemployment percent_college percent_married
## 0.0000000 0.1753068 0.1627849 0.3130478
## median_income average_act percent_lunch state
## 0.2504383 0.0000000 0.0000000 1.1019284
## zip_code school_type school_level
## 1.1019284 1.1019284 1.1019284
\(\rightarrow\) Find the rows where the missing value occurs in each column.
apply(is.na(edgap),2,which)
## $id
## integer(0)
##
## $rate_unemployment
## [1] 142 143 205 364 1056 1074 2067 2176 2717 2758 2931 3973 4150 6103
##
## $percent_college
## [1] 142 205 364 1056 1074 2067 2176 2717 2758 2931 3973 4150 6103
##
## $percent_married
## [1] 7 142 143 205 364 465 1056 1074 2067 2176 2313 2419 2593 2717 2758
## [16] 2774 2931 3905 3973 4150 4772 5879 6103 6626 7248
##
## $median_income
## [1] 142 143 205 364 465 1056 1074 2067 2176 2419 2593 2717 2758 2931 3973
## [16] 4150 5879 6103 6626 7248
##
## $average_act
## integer(0)
##
## $percent_lunch
## integer(0)
##
## $state
## [1] 482 483 485 486 487 488 490 507 1088 1428 1444 1458 1468 1575 1649
## [16] 2028 2029 2042 2046 2047 2105 2107 2575 2580 2589 2592 2617 2635 2688 2767
## [31] 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785
## [46] 2786 2787 2788 2932 3001 3127 3487 3544 3830 4397 4660 4662 4665 4737 4766
## [61] 4776 5096 5421 5453 5454 5496 5568 5764 6006 6007 6025 6027 6029 6032 6034
## [76] 6043 6044 6350 6354 6356 6357 6359 6379 6834 6901 7564 7568 7961
##
## $zip_code
## [1] 482 483 485 486 487 488 490 507 1088 1428 1444 1458 1468 1575 1649
## [16] 2028 2029 2042 2046 2047 2105 2107 2575 2580 2589 2592 2617 2635 2688 2767
## [31] 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785
## [46] 2786 2787 2788 2932 3001 3127 3487 3544 3830 4397 4660 4662 4665 4737 4766
## [61] 4776 5096 5421 5453 5454 5496 5568 5764 6006 6007 6025 6027 6029 6032 6034
## [76] 6043 6044 6350 6354 6356 6357 6359 6379 6834 6901 7564 7568 7961
##
## $school_type
## [1] 482 483 485 486 487 488 490 507 1088 1428 1444 1458 1468 1575 1649
## [16] 2028 2029 2042 2046 2047 2105 2107 2575 2580 2589 2592 2617 2635 2688 2767
## [31] 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785
## [46] 2786 2787 2788 2932 3001 3127 3487 3544 3830 4397 4660 4662 4665 4737 4766
## [61] 4776 5096 5421 5453 5454 5496 5568 5764 6006 6007 6025 6027 6029 6032 6034
## [76] 6043 6044 6350 6354 6356 6357 6359 6379 6834 6901 7564 7568 7961
##
## $school_level
## [1] 482 483 485 486 487 488 490 507 1088 1428 1444 1458 1468 1575 1649
## [16] 2028 2029 2042 2046 2047 2105 2107 2575 2580 2589 2592 2617 2635 2688 2767
## [31] 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785
## [46] 2786 2787 2788 2932 3001 3127 3487 3544 3830 4397 4660 4662 4665 4737 4766
## [61] 4776 5096 5421 5453 5454 5496 5568 5764 6006 6007 6025 6027 6029 6032 6034
## [76] 6043 6044 6350 6354 6356 6357 6359 6379 6834 6901 7564 7568 7961
Now we need to decide how to deal with the NAs. We have a few decisions to make:
NAs?NAs with something like the mean of the variable?NAs with estimates based on the other variable values?There are some schools that are missing all four of the socioeconomic variables, e.g. at rows 142 and 205. However, many of the schools are missing only a subset of the variables. If we drop rows that have NAs, then we will negatively affect our analysis using the variables where data were present. So, we will not drop the rows in this data set that are missing the socioeconomic variables. We have so few missing values from each value that we will not worry about replacing NAs with some other value. We will selectively omit the NAs when working with those columns.
There are, however, 88 schools where we do not have the school information. This raises the possibility that the information is unreliable. Because we are not able to check from the source, we will omit these rows from the data set.
\(\rightarrow\) Use the filter function to drop only those rows where the state information is missing.
edgap <- edgap %>% filter(is.na(state) == FALSE)
Recheck for missing values:
skim(edgap)
| Name | edgap |
| Number of rows | 7898 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 12 | 12 | 0 | 7898 | 0 |
| state | 0 | 1 | 2 | 2 | 0 | 20 | 0 |
| zip_code | 0 | 1 | 5 | 5 | 0 | 6519 | 0 |
| school_type | 0 | 1 | 14 | 27 | 0 | 4 | 0 |
| school_level | 0 | 1 | 4 | 12 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rate_unemployment | 14 | 1 | 0.10 | 0.06 | 0.00 | 0.06 | 0.09 | 0.12 | 0.59 | ▇▃▁▁▁ |
| percent_college | 13 | 1 | 0.57 | 0.17 | 0.09 | 0.45 | 0.56 | 0.68 | 1.00 | ▁▅▇▅▂ |
| percent_married | 24 | 1 | 0.64 | 0.20 | 0.00 | 0.53 | 0.67 | 0.78 | 1.00 | ▁▂▅▇▃ |
| median_income | 20 | 1 | 52167.27 | 24173.80 | 3589.00 | 36786.75 | 47000.00 | 61516.75 | 226181.00 | ▇▆▁▁▁ |
| average_act | 0 | 1 | 20.21 | 2.57 | -3.07 | 18.64 | 20.40 | 21.94 | 32.36 | ▁▁▂▇▁ |
| percent_lunch | 0 | 1 | 0.42 | 0.24 | -0.05 | 0.24 | 0.38 | 0.57 | 1.00 | ▃▇▆▃▂ |
We will do some quality control for the data set.
We can check the range of each variable to see that the values fall in the expected ranges.
summary(edgap)
## id rate_unemployment percent_college percent_married
## Length:7898 Min. :0.00000 Min. :0.09149 Min. :0.0000
## Class :character 1st Qu.:0.05854 1st Qu.:0.45120 1st Qu.:0.5261
## Mode :character Median :0.08523 Median :0.55572 Median :0.6686
## Mean :0.09806 Mean :0.56938 Mean :0.6354
## 3rd Qu.:0.12272 3rd Qu.:0.67719 3rd Qu.:0.7778
## Max. :0.59028 Max. :1.00000 Max. :1.0000
## NA's :14 NA's :13 NA's :24
## median_income average_act percent_lunch state
## Min. : 3589 Min. :-3.071 Min. :-0.05455 Length:7898
## 1st Qu.: 36787 1st Qu.:18.639 1st Qu.: 0.23772 Class :character
## Median : 47000 Median :20.400 Median : 0.37904 Mode :character
## Mean : 52167 Mean :20.211 Mean : 0.41824
## 3rd Qu.: 61517 3rd Qu.:21.935 3rd Qu.: 0.57060
## Max. :226181 Max. :32.363 Max. : 0.99873
## NA's :20
## zip_code school_type school_level
## Length:7898 Length:7898 Length:7898
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
There are a few suspicious values. The minimum average_act is -3.071, but ACT scores must be non-negative. Similarly, the minimum percent_lunch is -0.05455, but a percent must be non-negative. These are out-of-range values. We do not have access to information about how these particular data points were generated, so we will remove them from the data set by converting them to NA.
#Number of NA ACT scores before conversion
sum(is.na(edgap$average_act))
## [1] 0
#Convert negative scores to NA
edgap[edgap$average_act < 0,"average_act"] = NA
#Number of NA ACT scores after conversion
sum(is.na(edgap$average_act))
## [1] 3
There were 3 schools with negative ACT scores, where the values were omitted from the data set.
#Number of NA percent_lunch values before conversion
sum(is.na(edgap$percent_lunch))
## [1] 0
#Convert negative values to NA
edgap[edgap$percent_lunch < 0,"percent_lunch"] = NA
#Number of NA percent_lunch values after conversion
sum(is.na(edgap$percent_lunch))
## [1] 20
There were 20 schools with negative percent free or reduced lunch, where the values were omitted from the data set.
edgap %>%
count(school_level) %>%
mutate(proportion = round(n/sum(n),3))
## # A tibble: 4 × 3
## school_level n proportion
## <chr> <int> <dbl>
## 1 Elementary 2 0
## 2 High 7230 0.915
## 3 Not reported 35 0.004
## 4 Other 631 0.08
\(\rightarrow\) What school levels are present in the data set and with what frequency? Present the results as a table and as a bar graph.
edgap %>%
count(school_level) %>%
mutate(school_level = fct_reorder(school_level, n)) %>%
ggplot(aes(x = school_level, y = n)) +
geom_col() +
coord_flip() +
labs(x = "School level", y = "Count") +
theme_bw()
\(\rightarrow\) Modify the edgap data frame to include only the schools that are high schools.
edgap <- edgap %>% filter(school_level == "High")
edgap %>%
count(school_level) %>%
mutate(proportion = round(n/sum(n),3))
## # A tibble: 1 × 3
## school_level n proportion
## <chr> <int> <dbl>
## 1 High 7230 1
We have two main goals when doing exploratory data analysis. The first is that we want to understand the data set more completely. The second goal is to explore relationships between the variables to help guide the modeling process to answer our specific question.
Make a pairs plot of the ACT scores and the socioeconomic variables. We will use the ggpairs function from the GGally package.
#It may take a few seconds to produce the plot
edgap %>%
select(-c(id,state,zip_code,school_type,school_level)) %>%
ggpairs(lower = list(continuous=wrap("smooth"))) +
theme_bw()
\(\rightarrow\) Describe the shape of the distribution of each variable.
All distributions appear to be unimodal.
rate_unemployment has a positively skewed distribution. The values near 0.6 may be outliers.
percent_college has a roughly symmetric distribution.
percent_married has a negatively skewed distribution.
average_act has an approximately normal distribution. This is not surprising for a distribution of averages.
median_income has a positively skewed distribution, which is what you expect for income distributions.
percent_lunch has a slightly positively skewed distribution.
What states are present in the data set and with what frequency?
\(\rightarrow\) First find the states present.
edgap %>%
select(state) %>%
unique()
## # A tibble: 20 × 1
## state
## <chr>
## 1 DE
## 2 FL
## 3 GA
## 4 IL
## 5 IN
## 6 KY
## 7 LA
## 8 MA
## 9 MI
## 10 MO
## 11 NJ
## 12 NY
## 13 NC
## 14 OH
## 15 PA
## 16 TN
## 17 TX
## 18 WA
## 19 WI
## 20 WY
\(\rightarrow\) Now count the observations by state. Display the results as a bar graph.
edgap %>%
count(state) %>%
ggplot(aes(x = state, y = n)) +
geom_col() +
coord_flip() +
labs(x = "State", y = "Count") +
theme_bw()
\(\rightarrow\) Reorder the states by count using fct_reorder
edgap %>%
count(state) %>%
mutate(state = fct_reorder(state, n)) %>%
ggplot(aes(x = state, y = n)) +
geom_col() +
coord_flip() +
labs(x = "State", y = "Count") +
theme_bw()
\(\rightarrow\) What school types are present in the data set and with what frequency? Present the results as a table and as a bar graph.
edgap %>%
count(school_type) %>%
mutate(proportion = round(n/sum(n),3))
## # A tibble: 4 × 3
## school_type n proportion
## <chr> <int> <dbl>
## 1 Alternative School 9 0.001
## 2 Career and Technical School 1 0
## 3 Regular School 7218 0.998
## 4 Special Education School 2 0
The vast majority (99.8%) are regular schools, but there are a few other types of schools.
We can also present the information as a graph:
edgap %>%
count(school_type) %>%
mutate(school_type = fct_reorder(school_type, n)) %>%
ggplot(aes(x = school_type, y = n)) +
geom_col() +
coord_flip() +
labs(x = "School type", y = "Count") +
theme_bw()
but the graph can be misleading because of the large number of regular schools and the few schools of the other types.
edgap %>%
filter(school_type != "Regular School") %>%
count(school_type) %>%
mutate(school_type = fct_reorder(school_type, n)) %>%
ggplot(aes(x = school_type, y = n)) +
geom_col() +
coord_flip() +
labs(x = "School type", y = "Count") +
theme_bw()
average_actOur primary goal is to determine whether there is a relationship between the socioeconomic variables and average_act, so we will make plots to focus on those relationships.
The largest correlation coefficient between a socioeconomic variable and average_act is for percent_lunch, so we will start there.
\(\rightarrow\) Make a scatter plot of percent_lunch and average_act
ggplot(edgap, aes(x = percent_lunch, y = average_act)) +
geom_point() +
labs(x = 'Percent reduced or free lunch', y = 'Average school ACT or equivalent if SAT') +
theme_bw()
## Warning: Removed 23 rows containing missing values (geom_point).
There is a clear negative linear association between percent_lunch and average_act.
Why is there a wide range of ACT scores at percent_lunch = 0? It may be the case that some schools have no students receiving free or reduced lunch because the program does not exist, not because there are no families that would qualify under typical standards. This would require further research into the availability of the free or reduced price lunch program.
\(\rightarrow\) Make a scatter plot of median_income and average_act
ggplot(edgap, aes(x = median_income, y = average_act)) +
geom_point() +
labs(x = 'Median Income ($)', y = 'Average school ACT or equivalent if SAT') +
theme_bw()
## Warning: Removed 19 rows containing missing values (geom_point).
There is a positive association between median_income and average_act. The form of the association appears to be nonlinear.
\(\rightarrow\) median_income has a skewed distribution, so make a scatter plot with median_income plotted on a log scale.
ggplot(edgap, aes(x = median_income, y = average_act)) +
geom_point() +
scale_x_log10() +
labs(x = 'Median Income ($)', y = 'Average school ACT or equivalent if SAT') +
theme_bw()
## Warning: Removed 19 rows containing missing values (geom_point).
There is a positive linear associate between the log of median_income and average_act. Note also that the typical regression assumption of equal variance of errors is closer to being satisfied after the log transformation.
\(\rightarrow\) Make a scatter plot of percent_college and average_act
ggplot(edgap, aes(x = percent_college, y = average_act)) +
geom_point() +
labs(x = 'Percent college', y = 'Average school ACT or equivalent if SAT') +
theme_bw()
## Warning: Removed 14 rows containing missing values (geom_point).
There is a positive association between percent_college and average_act. There may be a nonlinear component to the association, but it is difficult to assess visually from the scatter plot.
\(\rightarrow\) Make a scatter plot of percent_married and average_act
ggplot(edgap, aes(x = percent_married, y = average_act)) +
geom_point() +
labs(x = 'Percent married', y = 'Average school ACT or equivalent if SAT') +
theme_bw()
## Warning: Removed 23 rows containing missing values (geom_point).
There is a positive association between percent_married and average_act.
\(\rightarrow\) Make a scatter plot of rate_unemployment and average_act
rate_unemployment has a positively skewed distribution, so we should make a scatter plot on a transformed scale. There are values of rate_unemployment that are equal to zero, so we will use a square root transformation, rather than a log transformation.
min(edgap$rate_unemployment, na.rm = TRUE)
## [1] 0
library(latex2exp) #This library is used for LaTex in the axis labels
edgap %>%
ggplot(aes(x = sqrt(rate_unemployment), y = average_act)) +
geom_point() +
labs(x = TeX('$\\sqrt{$Rate of unemployment$}$'), y = 'Average school ACT or equivalent if SAT') +
theme_bw()
## Warning: Removed 15 rows containing missing values (geom_point).
\(\rightarrow\) Use the skim function to examine the basic numerical summaries for each variable.
edgap %>% select(-id) %>% skim()
| Name | Piped data |
| Number of rows | 7230 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 2 | 2 | 0 | 20 | 0 |
| zip_code | 0 | 1 | 5 | 5 | 0 | 6128 | 0 |
| school_type | 0 | 1 | 14 | 27 | 0 | 4 | 0 |
| school_level | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rate_unemployment | 12 | 1 | 0.10 | 0.06 | 0.00 | 0.06 | 0.08 | 0.12 | 0.59 | ▇▂▁▁▁ |
| percent_college | 11 | 1 | 0.57 | 0.17 | 0.09 | 0.45 | 0.56 | 0.68 | 1.00 | ▁▅▇▅▂ |
| percent_married | 20 | 1 | 0.64 | 0.19 | 0.00 | 0.53 | 0.67 | 0.78 | 1.00 | ▁▂▅▇▃ |
| median_income | 16 | 1 | 52760.47 | 24365.51 | 4833.00 | 37106.00 | 47404.00 | 62106.00 | 226181.00 | ▇▆▁▁▁ |
| average_act | 3 | 1 | 20.30 | 2.51 | 12.36 | 18.80 | 20.50 | 22.00 | 32.36 | ▁▇▇▁▁ |
| percent_lunch | 20 | 1 | 0.41 | 0.23 | 0.00 | 0.23 | 0.37 | 0.56 | 1.00 | ▅▇▆▃▂ |
Our exploratory data analysis has led to the following observations that will help guide the modeling process:
In this project, we are very concerned with understanding the relationships in the data set; we are not only concerned with building a model with the highest prediction accuracy. Therefore, we will start by looking at simple linear regression models using each socioeconomic predictor.
Fit a model
We want to fit the simple linear regression model
average_act \(\approx \beta_0 + \beta_1\) percent_lunch
Use the function lm to fit a simple linear regression model.
fit_pl <- lm(average_act ~ percent_lunch, data = edgap)
Use the summary function to
summary(fit_pl)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.744781 0.03693874 642.815 0
## percent_lunch -8.393695 0.07814631 -107.410 0
summary(fit_pl)
##
## Call:
## lm(formula = average_act ~ percent_lunch, data = edgap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7448 -0.9755 -0.0963 0.8736 11.4752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.74478 0.03694 642.8 <2e-16 ***
## percent_lunch -8.39370 0.07815 -107.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.555 on 7205 degrees of freedom
## (23 observations deleted due to missingness)
## Multiple R-squared: 0.6156, Adjusted R-squared: 0.6155
## F-statistic: 1.154e+04 on 1 and 7205 DF, p-value: < 2.2e-16
The intercept is 23.7 and the slope is -8.39. The intercept is interpreted as the best estimate for the mean ACT score when percent_lunch is zero. The slope can be interpreted as saying that a percent_lunch increase of 0.1 is associated with a decrease in the estimated mean ACT score of -0.839 points.
The slope is highly statistically significant, as the p-values are reported as zero. So, there is a statistically significant relationship between percent_lunch and the average ACT score in a school.
Plot the regression line together with the scatter plot
#geom_smooth(method = "lm",formula = y ~ x) adds the regression line to the scatter plot
ggplot(edgap, aes(x = percent_lunch, y = average_act)) +
geom_point() +
geom_smooth(method = "lm",formula = y ~ x) +
labs(x = "Percent free or reduced lunch", y = "School ACT average (or equivalent if SAT score)") +
theme_bw()
Assess the accuracy of the fit
Examine the \(R^2\) and the residual standard error
summary(fit_pl)$r.squared
## [1] 0.6155674
# r^2 gives histogram for a SPECIFIC GIVEN x. Aka you explained r^2*100 % of the variance by your model
#also residual std err (not squared) gives you error bounds (ie you're 'off' by about +/-1.6 at each point)
The simple linear regression model using percent free or reduced lunch as a predictor can explain 61.6% in the variance of the ACT score.
summary(fit_pl)$sigma
## [1] 1.55522
The residual standard error is 1.56 points. This means that the model is off by about 1.56 points, on average.
Together, these results show that we can make a good prediction of the average ACT score in a school only by knowing the percent of students receiving free or reduced lunch.
Note that, in this example, we called different components from the summary output individually. We took this approach to walk through each step of the analysis, but you can simply look at the output of summary(fit_pl) to see all of the information at once.
Residual plot
Examine a residual plot to see if we can improve the model through a transformation of the percent_lunch variable.
#We will use ggplot2 to make the residual plot. You can also use plot(fit_pl)
ggplot(fit_pl,aes(x=percent_lunch, y=.resid)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Percent free or reduced lunch", y = "Residuals") +
theme_bw()
# if the residual had cubic structure, maybe should've used cubic, if parabolic, should've use x^2, etc for other structure
The residual plot does not have much systematic structure, so we have used percent_lunch as well as we can as a predictor. We do not need to consider a more complicated model that only uses percent_lunch as an input variable.
Fit a model
We want to fit the simple linear regression model
average_act \(\approx \beta_0 + \beta_1\) median_income
\(\rightarrow\) Use the function lm to fit a simple linear regression model.
fit_mi <- lm(average_act ~ median_income, data = edgap)
\(\rightarrow\) Use the summary function to:
summary(fit_mi)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.780119e+01 6.253060e-02 284.67963 0
## median_income 4.733825e-05 1.075819e-06 44.00205 0
The intercept is 17.8 and the slope is \(4.7\times 10^{-5}\).
The intercept is interpreted as the best estimate for the mean ACT score when median_income is zero. The median income is never equal to zero, so the intercept does not have a clear interpretation in this model.
The slope can be interpreted as saying that a difference of $1 in median_income is associated with a difference in the estimated mean ACT score of \(4.7 \times 10^{-5}\) points. This value of the slope is not intuitive because the median income is in dollars, but differences in annual income of several dollars is not meaningful. It would be helpful to represent median income in tens of thousands of dollars to make a one-unit change more meaningful.
We will use the mutate function to add a new variable to the data frame called median_income_10k that represents median income in tens of thousands of dollars.
edgap <- edgap %>%
mutate(median_income_10k = median_income/1e4)
We can refit the model
fit_mi <- lm(average_act ~ median_income_10k, data = edgap)
summary(fit_mi)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.8011868 0.06253060 284.67963 0
## median_income_10k 0.4733825 0.01075819 44.00205 0
The intercept remains 17.8, but the slope is now 0.47, corresponding to the scaling of the median income units.
The slope can be interpreted as saying that a difference of $10,000 in median income is associated with a difference in the estimated mean ACT score of 0.47 points.
The slope is highly statistically significant. So, there is a statistically significant relationship between median income and the average ACT score in a school.
\(\rightarrow\) Plot the regression line together with the scatter plot
ggplot(edgap, aes(x = median_income_10k, y = average_act)) +
geom_point() +
geom_smooth(method = "lm",formula = y ~ x) +
labs(x = "Median Household Income (ten thousand $)", y = "School ACT average (or equivalent if SAT score)") +
theme_bw()
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).
Assess the accuracy of the fit
\(\rightarrow\) Examine the \(R^2\) and the residual standard error
summary(fit_mi)$r.squared
## [1] 0.2117159
The simple linear regression model using median income as a predictor explains 21.2% in the variance of the ACT score.
summary(fit_mi)$sigma
## [1] 2.225708
The residual standard error is 2.26 points. This means that the model is off by about 2.26 points, on average.
Together, these results show that we can do some moderate prediction of the average ACT score in a school only by knowing the median income.
Residual plot
\(\rightarrow\) Examine a residual plot to see if we can improve the model through a transformation of the median_income_10 variable.
ggplot(fit_mi,aes(x=median_income_10k, y=.resid)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Median Household Income (ten thousand $)", y = "Residuals") +
theme_bw()
The residual plot suggests that a nonlinear model may improve the fit.
We will try a log transformation.
Log transformation
Fit the model
\(\rightarrow\) Fit a linear regression model using the log of median_income_10 as the predictor.
fit_mi_log <- lm(average_act ~ log10(median_income_10k), data = edgap)
\(\rightarrow\) Use the summary function to:
summary(fit_mi_log)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.040603 0.09908011 161.89529 0
## log10(median_income_10k) 6.245003 0.14014512 44.56097 0
The intercept is 16.0. The intercept is interpreted as the prediction of the average ACT score when the predictor is zero, which means that the median income is $10,000.
The slope is 6.3. Because we used a base 10 log, the slope is interpreted as saying that a median income ten times greater than another is associated with an average ACT score that is 6.3 points higher.
The slope is highly statistically significant.
\(\rightarrow\) Plot the regression line together with the scatter plot
ggplot(edgap, aes(x = log10(median_income_10k), y = average_act)) +
geom_point() +
geom_smooth(method = "lm",formula = y ~ x) +
labs(x = "Log of median Household Income", y = "School ACT average (or equivalent if SAT score)") +
theme_bw()
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).
Assess the accuracy of the fit
\(\rightarrow\) Examine the \(R^2\) and the residual standard error
summary(fit_mi_log)$r.squared
## [1] 0.2159597
The simple linear regression model using median income as a predictor can explain 21.6% in the variance of the ACT score.
summary(fit_mi_log)$sigma
## [1] 2.219709
The residual standard error is 2.22 points. This means that the model is off by about 2.22 points, on average.
Together, these results show that we can do some moderate prediction of the average ACT score in a school only by knowing the median income.
Residual plot
\(\rightarrow\) Examine a residual plot to see if we can improve the model through a transformation of the input.
ggplot(fit_mi_log, aes(x=.fitted, y=.resid)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Log Median Household Income", y = "Residuals") +
theme_bw()
The structure in the residual plot appears to be due to the points near 14. Therefore, we will not consider further transformations.
Fit a model
We want to fit the simple linear regression model
average_act \(\approx \beta_0 + \beta_1\) percent_college
\(\rightarrow\) Use the function lm to fit a simple linear regression model.
fit_pc <- lm(average_act ~ percent_college, data = edgap)
\(\rightarrow\) Use the summary function to
summary(fit_pc)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.307759 0.09476034 172.09477 0
## percent_college 6.966002 0.15889860 43.83929 0
The slope is highly statistically significant. So, there is a statistically significant relationship between percent college and the average ACT score in a school.
\(\rightarrow\) Plot the regression line together with the scatter plot
ggplot(edgap, aes(x = percent_college, y = average_act)) +
geom_point() +
geom_smooth(method = "lm",formula = y ~ x) +
labs(x = "Percent college", y = "School ACT average (or equivalent if SAT score)") +
theme_bw()
## Warning: Removed 14 rows containing non-finite values (stat_smooth).
## Warning: Removed 14 rows containing missing values (geom_point).
Assess the accuracy of the fit
\(\rightarrow\) Examine the \(R^2\) and the residual standard error
summary(fit_pc)$r.squared
## [1] 0.2103665
The simple linear regression model using median income as a predictor can explain 21.0% in the variance of the ACT score.
summary(fit_pc)$sigma
## [1] 2.227256
The residual standard error is 2.23 points. This means that the model is off by about 2.23 points, on average.
Together, these results show that we can do some moderate prediction of the average ACT score in a school only by knowing the percent of college graduates among adults.
Residual plot
\(\rightarrow\) Examine a residual plot to see if we can improve the model through a transformation of the input variable.
ggplot(fit_pc, aes(x=percent_college, y=.resid)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Percent college", y = "Residuals") +
theme_bw()
The residual plot has a small suggestion that a quadratic model might be useful.
\(\rightarrow\) Fit and assess a quadratic model
fit_pc_quad <- lm(average_act ~ poly(percent_college,2, raw = T), data = edgap)
summary(fit_pc_quad)
##
## Call:
## lm(formula = average_act ~ poly(percent_college, 2, raw = T),
## data = edgap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3502 -1.2283 0.2854 1.4668 11.1775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.1091 0.2686 59.980 <2e-16 ***
## poly(percent_college, 2, raw = T)1 7.6930 0.9333 8.242 <2e-16 ***
## poly(percent_college, 2, raw = T)2 -0.6129 0.7754 -0.790 0.429
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.227 on 7213 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.2104, Adjusted R-squared: 0.2102
## F-statistic: 961.2 on 2 and 7213 DF, p-value: < 2.2e-16
The coefficient on the squared term is not significant, so the quadratic model does not provide an improvement over the linear model.
Fit a model
We want to fit the simple linear regression model
average_act \(\approx \beta_0 + \beta_1\) rate_unemployment
\(\rightarrow\) Use the function lm to fit a simple linear regression model.
fit_ru <- lm(average_act ~ rate_unemployment, data = edgap)
\(\rightarrow\) Use the summary function to:
summary(fit_ru)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.15006 0.05252349 421.71719 0
## rate_unemployment -19.19084 0.46971479 -40.85636 0
The slope is highly statistically significant. So, there is a statistically significant relationship between unemployment rate and the average ACT score in a school.
\(\rightarrow\) Plot the regression line together with the scatter plot
ggplot(edgap, aes(x = rate_unemployment, y = average_act)) +
geom_point() +
geom_smooth(method = "lm",formula = y ~ x) +
labs(x = "Unemployment rate", y = "School ACT average (or equivalent if SAT score)") +
theme_bw()
## Warning: Removed 15 rows containing non-finite values (stat_smooth).
## Warning: Removed 15 rows containing missing values (geom_point).
Assess the accuracy of the fit
\(\rightarrow\) Examine the \(R^2\) and the residual standard error
summary(fit_ru)$r.squared
## [1] 0.1879303
The simple linear regression model using unemployment rate as a predictor can explain 18.8% in the variance of the ACT score.
summary(fit_ru)$sigma
## [1] 2.258699
The residual standard error is 2.26 points. This means that the model is off by about 2.26 points, on average.
Together, these results show that we can do some moderate prediction of the average ACT score in a school only by knowing the unemployment rate.
Residual plot
\(\rightarrow\) Examine a residual plot to see if we can improve the model.
ggplot(fit_ru,aes(x=rate_unemployment, y=.resid)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Unemployment rate", y = "Residuals") +
theme_bw()
Try a square root transformation of rate_unemployment
fit_ru_sqrt <- lm(average_act ~ sqrt(rate_unemployment), data = edgap)
summary(fit_ru_sqrt)
##
## Call:
## lm(formula = average_act ~ sqrt(rate_unemployment), data = edgap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9088 -1.3999 0.0888 1.4304 12.1059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.09827 0.09705 248.3 <2e-16 ***
## sqrt(rate_unemployment) -12.72070 0.31252 -40.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.26 on 7213 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.1868, Adjusted R-squared: 0.1867
## F-statistic: 1657 on 1 and 7213 DF, p-value: < 2.2e-16
The square root transformation provides a small improvement over the linear model.
ggplot(edgap, aes(x = rate_unemployment, y = average_act)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x) +
geom_smooth(method = "lm", formula = y ~ sqrt(x),color = 'red') +
labs(x = "Unemployment rate", y = "School ACT average (or equivalent if SAT score)") +
theme_bw()
## Warning: Removed 15 rows containing non-finite values (stat_smooth).
## Removed 15 rows containing non-finite values (stat_smooth).
## Warning: Removed 15 rows containing missing values (geom_point).
The models are very similar for most of the data, but differ a lot for high unemployment rates. There are few data points with high unemployment rates, so the numerical measures of accuracy are very similar.
ggplot(fit_ru_sqrt,aes(x=.fitted, y=.resid)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Square root of unemployment rate", y = "Residuals") +
theme_bw()
Fit a model
We want to fit the simple linear regression model
average_act \(\approx \beta_0 + \beta_1\) percent_married
\(\rightarrow\) Use the function lm to fit a simple linear regression model.
fit_pm <- lm(average_act ~ percent_married, data = edgap)
\(\rightarrow\) Use the summary function to:
summary(fit_pm)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.599927 0.09268168 179.10690 0
## percent_married 5.774247 0.13862735 41.65301 0
The slope is highly statistically significant. So, there is a statistically significant relationship between percent married and the average ACT score in a school.
\(\rightarrow\) Plot the regression line together with the scatter plot
ggplot(edgap, aes(x = percent_married, y = average_act)) +
geom_point() +
geom_smooth(method = "lm",formula = y ~ x) +
labs(x = "Percent married", y = "School ACT average (or equivalent if SAT score)") +
theme_bw()
## Warning: Removed 23 rows containing non-finite values (stat_smooth).
## Warning: Removed 23 rows containing missing values (geom_point).
Assess the accuracy of the fit
\(\rightarrow\) Examine the \(R^2\) and the residual standard error
summary(fit_pm)$r.squared
## [1] 0.1940692
The simple linear regression model using percent married as a predictor can explain 19.4% in the variance of the ACT score.
summary(fit_pm)$sigma
## [1] 2.250251
The residual standard error is 2.25 points. This means that the model is off by about 2.25 points, on average.
Together, these results show that we can do some moderate prediction of the average ACT score in a school only by knowing the median income.
Residual plot
\(\rightarrow\) Examine a residual plot to see if we can improve the model.
ggplot(fit_pm,aes(x=percent_married, y=.resid)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Percent married", y = "Residuals") +
theme_bw()
The residual plot suggests a quadratic model may be useful.
fit_pm_quad <- lm(average_act ~ poly(percent_married, 2, raw = T), data = edgap)
summary(fit_pm_quad)
##
## Call:
## lm(formula = average_act ~ poly(percent_married, 2, raw = T),
## data = edgap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1179 -1.4156 0.0554 1.3895 10.9173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.0161 0.1808 94.093 < 2e-16 ***
## poly(percent_married, 2, raw = T)1 4.0920 0.6430 6.364 2.08e-10 ***
## poly(percent_married, 2, raw = T)2 1.4801 0.5524 2.679 0.00739 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.249 on 7204 degrees of freedom
## (23 observations deleted due to missingness)
## Multiple R-squared: 0.1949, Adjusted R-squared: 0.1946
## F-statistic: 871.8 on 2 and 7204 DF, p-value: < 2.2e-16
The quadratic term is statistically significant, even though the improvement in accuracy is not large.
ggplot(edgap, aes(x = percent_married, y = average_act)) +
geom_point() +
geom_smooth(method = "lm",formula = y ~ x) +
geom_smooth(method = "lm",formula = y ~ poly(x,2), color = "red") +
labs(x = "Percent married", y = "School ACT average (or equivalent if SAT score)") +
theme_bw()
## Warning: Removed 23 rows containing non-finite values (stat_smooth).
## Removed 23 rows containing non-finite values (stat_smooth).
## Warning: Removed 23 rows containing missing values (geom_point).
The graph shows that the models are nearly identical.
Now that we understand how each variable is individually related to the ACT score, we want to know how to best use all of the socioeconomic variables to predict the ACT score.
We do not have many input variables, so we can examine the full model.
fit_full <- lm(average_act ~ percent_lunch + median_income + rate_unemployment + percent_college + percent_married, data = edgap)
Examine the summary of the fit
summary(fit_full)
##
## Call:
## lm(formula = average_act ~ percent_lunch + median_income + rate_unemployment +
## percent_college + percent_married, data = edgap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5334 -0.9347 -0.0858 0.8578 10.7692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.269e+01 1.371e-01 165.492 < 2e-16 ***
## percent_lunch -7.604e+00 9.650e-02 -78.801 < 2e-16 ***
## median_income -1.081e-07 1.205e-06 -0.090 0.929
## rate_unemployment -2.324e+00 4.065e-01 -5.717 1.13e-08 ***
## percent_college 1.754e+00 1.570e-01 11.171 < 2e-16 ***
## percent_married -7.501e-02 1.332e-01 -0.563 0.573
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.522 on 7181 degrees of freedom
## (43 observations deleted due to missingness)
## Multiple R-squared: 0.6316, Adjusted R-squared: 0.6314
## F-statistic: 2463 on 5 and 7181 DF, p-value: < 2.2e-16
The coefficients for median_income and percent_married are not statistically significant. Additionally, the sign of the coefficients for median_income and percent_married do not make sense. These results support removing median_income and percent_married from the model.
Use the regsubsets function from the leaps package to perform best subset selection in order to choose the best model to predict average_act from the socioeconomic predictors.
#perform best subset selection
regfit_full <- regsubsets(average_act ~ percent_lunch + median_income + rate_unemployment + percent_college + percent_married, data = edgap)
Get the summary of the best subset selection analysis
reg_summary <- summary(regfit_full)
What is the best model obtained according to Cp, BIC, and adjusted \(R^2\)? Make a plot of Cp, BIC, and adjusted \(R^2\) vs. the number of variables in the model.
#Set up a three panel plot
par(mfrow = c(1,3))
#Plot Cp
plot(reg_summary$cp,type = "b",xlab = "Number of variables",ylab = "Cp")
#Identify the minimum Cp
ind_cp = which.min(reg_summary$cp)
points(ind_cp, reg_summary$cp[ind_cp],col = "red",pch = 20)
#Plot BIC
plot(reg_summary$bic,type = "b",xlab = "Number of variables",ylab = "BIC")
#Identify the minimum BIC
ind_bic = which.min(reg_summary$bic)
points(ind_bic, reg_summary$bic[ind_bic],col = "red",pch = 20)
#Plot adjusted R^2
plot(reg_summary$adjr2,type = "b",xlab = "Number of variables",ylab = TeX('Adjusted $R^2$'),ylim = c(0,1))
#Identify the maximum adjusted R^2
ind_adjr2 = which.max(reg_summary$adjr2)
points(ind_adjr2, reg_summary$adjr2[ind_adjr2],col = "red",pch = 20)
The three measures agree that the best model has three variables.
Show the best model for each possible number of variables. Focus on the three variable model.
reg_summary$outmat
## percent_lunch median_income rate_unemployment percent_college
## 1 ( 1 ) "*" " " " " " "
## 2 ( 1 ) "*" " " " " "*"
## 3 ( 1 ) "*" " " "*" "*"
## 4 ( 1 ) "*" " " "*" "*"
## 5 ( 1 ) "*" "*" "*" "*"
## percent_married
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
The best model uses the predictors percent_lunch, rate_unemployment and percent_college.
Fit the best model and examine the results
fit_best <- lm(average_act ~ percent_lunch + rate_unemployment + percent_college, data = edgap)
summary(fit_best)
##
## Call:
## lm(formula = average_act ~ percent_lunch + rate_unemployment +
## percent_college, data = edgap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4923 -0.9363 -0.0886 0.8611 10.7536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.62497 0.10206 221.688 < 2e-16 ***
## percent_lunch -7.59143 0.09212 -82.405 < 2e-16 ***
## rate_unemployment -2.13118 0.37252 -5.721 1.1e-08 ***
## percent_college 1.73653 0.12637 13.741 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.522 on 7191 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.6311, Adjusted R-squared: 0.631
## F-statistic: 4101 on 3 and 7191 DF, p-value: < 2.2e-16
To compare the magnitude of the coefficients, we should first normalize the predictors. Each of the predictors percent_lunch, rate_unemployment and percent_college is limited to the interval (0,1), but they occupy different parts of the interval. We can normalize each variable through a z-score transformation:
scale_z <- function(x, na.rm = TRUE) (x - mean(x, na.rm = na.rm)) / sd(x, na.rm)
edgap_z <- edgap %>%
mutate_at(c("percent_lunch","rate_unemployment","percent_college"),scale_z)
\(\rightarrow\) Fit the model using the transformed variables and examine the coefficients
fit_z <- lm(average_act ~ percent_lunch + rate_unemployment + percent_college, data = edgap_z)
summary(fit_z)
##
## Call:
## lm(formula = average_act ~ percent_lunch + rate_unemployment +
## percent_college, data = edgap_z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4923 -0.9363 -0.0886 0.8611 10.7536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.29713 0.01795 1130.931 < 2e-16 ***
## percent_lunch -1.78055 0.02161 -82.405 < 2e-16 ***
## rate_unemployment -0.12065 0.02109 -5.721 1.1e-08 ***
## percent_college 0.28665 0.02086 13.741 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.522 on 7191 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.6311, Adjusted R-squared: 0.631
## F-statistic: 4101 on 3 and 7191 DF, p-value: < 2.2e-16
The coefficient on percent_lunch is an order of magnitude larger than the other coefficients. This supports the conclusion that percent_lunch is the most important predictor.
Additionally, the performance of the single predictor model using percent_lunch is very close to the performance of the best model.
\(\rightarrow\) In addition to completing the above analyses, you should ask and answer one question about the data set.
We chose to compare Washington State and the State of New York. Both of these states have similar (and sufficiently large enough) numbers of entries. This reduces the chance of other errors/biases being included due to different sample sizes.
edgap_count <- edgap %>%
count(state) %>%
mutate(state = fct_reorder(state, n))
ggplot(edgap_count, aes(x = state, y = n)) +
geom_col() +
labs(x = "State", y = "Count")
We’ll create a new dataset that selects only the socioeconomic factors we want to look at only from observations in the two states we are comparing.
edgap_analysis <- edgap %>%
filter(state == c("WA", "NY")) %>%
mutate(state = as.factor(state)) %>%
select(rate_unemployment, percent_college, percent_married, median_income, average_act, percent_lunch, state)
head(edgap_analysis)
## # A tibble: 6 × 7
## rate_unemployment percent_college percent_married median_income average_act
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0926 0.821 0.728 71641 17.7
## 2 0.0614 0.853 0.891 78300 16.9
## 3 0.0484 0.877 1 82296 19.5
## 4 0.0512 0.962 0.905 145875 21.4
## 5 0.0419 0.922 0.902 108736 16.4
## 6 0.0994 0.901 0.718 105427 18.8
## # … with 2 more variables: percent_lunch <dbl>, state <fct>
We can visualize the distributions of the socioeconomic factors we are studying to help us identify these differences between the two states.
ggplot(edgap_analysis, aes(x=percent_lunch, color=state, fill=state)) + geom_density(alpha=0.1) + labs(title="Percent Free & Reduced Lunch Distribution", x="Percent Free & Reduced Lunch", y="Density") + theme_bw()
ggplot(edgap_analysis, aes(x=median_income, color=state, fill=state)) + geom_density(alpha=0.1) + labs(title="Median Income Distribution", x="Median Income", y="Density") + theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_density).
ggplot(edgap_analysis, aes(x=percent_college, color=state, fill=state)) + geom_density(alpha=0.1) + labs(title="Percent College Distribution", x="Percent College", y="Density") + theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_density).
ggplot(edgap_analysis, aes(x=percent_married, color=state, fill=state)) + geom_density(alpha=0.1) + labs(title="Percent Married Distribution", x="Percent Married", y="Density") + theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_density).
ggplot(edgap_analysis, aes(x=rate_unemployment, color=state, fill=state)) + geom_density(alpha=0.1) + labs(title="Rate Unemployment Distribution", x="Rate Unemployment", y="Density") + theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_density).
ggplot(edgap_analysis, aes(x=average_act, color=state, fill=state)) + geom_density(alpha=0.1) + labs(title="Average ACT Distribution", x="ACT Score (or equivalent if SAT)", y="Density") + theme_bw()
From these plots we can see that the distributions of most variables in both states are similar, except for percent maried, average ACT score, and percent free and reduced lunch. The percentage of students with free and reduced lunch in WA and NY in particular are drastically different, depsite relatively similar median income distributions. Why is that? The state of New York has a lot of high school students that have free/reduced lunch (with majority in 75%) while Washington has a lower percentage of students that have free/reduced lunch (with majority in 25-30%).
For additional research we thought that it might have been the different policies that each state have that qualifies a person for free/reduced lunch in each state.
In Washington State, a family of 4 will qualify for free/reduced lunch will have a maximum income of $51,338 (https://www.benefits.gov/benefit/1994). In the State of New York, a family of 4 will qualify for free lunch will have a maximum income of $34,450 and reduced lunch will have a maximum income of $49,025 (https://otda.ny.gov/workingfamilies/schoollunch.asp). Therefore, there is a lower threshold to qualify for free and reduced lunch in New York compared to Washington. This explains the distributions we saw above and makes the comparison across these two states an interesting one to examine in more detail. Hopefully, the different rates of percent lunch might help us isolate the variable and gain a deeper understanding of how it impacts the average ACT score. An additional benefit of comparing the two states is that neither state required students to take the ACT or SAT at the time the data was collected (https://magoosh.com/hs/sat/states-that-require-the-act-or-sat/), which could skew the average ACT scores.
Here we are analyzing the linear regression for both of these states comparing Average ACT score and the socioeconomic factors. This can help us visualize the differences between the two states.
ggplot(edgap_analysis, aes(x = percent_lunch, y = average_act, color=state)) +
geom_point() +
labs(x = 'Percent reduced or free lunch', y = 'Average school ACT or equivalent if SAT') + geom_smooth(method="lm", formula = y~x, se=FALSE) +
theme_bw()
ggplot(edgap_analysis, aes(x = median_income, y = average_act, color=state)) +
geom_point() +
labs(x = 'Median Income ($)', y = 'Average school ACT or equivalent if SAT') + geom_smooth(method="lm", formula = y~x, se=FALSE) +
theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
ggplot(edgap_analysis, aes(x = median_income, y = average_act, color=state)) +
geom_point() +
scale_x_log10() +
labs(title="Log Scale", x = 'Median Income ($)', y = 'Average school ACT or equivalent if SAT') + geom_smooth(method="lm", formula = y~x, se=FALSE) +
theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Removed 1 rows containing missing values (geom_point).
ggplot(edgap_analysis, aes(x = percent_college, y = average_act, color=state)) +
geom_point() +
labs(x = 'Percent college', y = 'Average school ACT or equivalent if SAT') + geom_smooth(method="lm", formula = y~x, se=FALSE) +
theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Removed 1 rows containing missing values (geom_point).
ggplot(edgap_analysis, aes(x = percent_married, y = average_act, color=state)) +
geom_point() +
labs(x = 'Percent married', y = 'Average school ACT or equivalent if SAT') + geom_smooth(method="lm", formula = y~x, se=FALSE) +
theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Removed 1 rows containing missing values (geom_point).
library(latex2exp) #This library is used for LaTex in the axis labels
edgap_analysis %>%
ggplot(aes(x = sqrt(rate_unemployment), y = average_act, color=state)) +
geom_point() +
labs(x = TeX('$\\sqrt{$Rate of unemployment$}$'), y = 'Average school ACT or equivalent if SAT') + geom_smooth(method="lm", formula = y~x, se=FALSE) +
theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Removed 1 rows containing missing values (geom_point).
From these plots and the previous distributions, we can see that Washington state seems to have higher average ACT scores. We can also see that the impact of a couple of the socioeconomic factors differs across the two states, which appears as different slopes in their regression lines. In particular, the regression lines for percent free and reduced lunch in the two states are very different, while all others appear to have approximately similar slopes. The different slope might mean that the impact of free and reduced lunch on ACT scores in the two states differs. This may be related to the aforementioned different policies.
To more closely examine the relation between percent free and reduced lunch and average ACT scores in the two states, we will build a linear regression model that accounts for the interaction between the percent free and reduced lunch and the state variables. This will allow us to capture and compare the different trends in each state.
Here, we’ll determine which combination of variables produces a model with the closest fit to the data in the dataset.
regfit_full <- regsubsets(average_act ~ . + percent_lunch:state, data = edgap_analysis)
reg_summary <- summary(regfit_full)
reg_summary$outmat
## rate_unemployment percent_college percent_married median_income
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " "*" " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " "*" " "
## 5 ( 1 ) " " "*" "*" " "
## 6 ( 1 ) "*" "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*"
## percent_lunch stateWA percent_lunch:stateWA
## 1 ( 1 ) "*" " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" "*" "*"
## 4 ( 1 ) "*" "*" "*"
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
reg_summary$which
## (Intercept) rate_unemployment percent_college percent_married median_income
## 1 TRUE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE TRUE FALSE
## 3 TRUE FALSE FALSE FALSE FALSE
## 4 TRUE FALSE FALSE TRUE FALSE
## 5 TRUE FALSE TRUE TRUE FALSE
## 6 TRUE TRUE TRUE TRUE FALSE
## 7 TRUE TRUE TRUE TRUE TRUE
## percent_lunch stateWA percent_lunch:stateWA
## 1 TRUE FALSE FALSE
## 2 TRUE FALSE FALSE
## 3 TRUE TRUE TRUE
## 4 TRUE TRUE TRUE
## 5 TRUE TRUE TRUE
## 6 TRUE TRUE TRUE
## 7 TRUE TRUE TRUE
This tells us some possible important variables in predicting the average ACT score in our data might be percent_lunch, stateWA, percent_lunch:stateWA, and percent_married.
To get a better understanding of which variables are the strongest predictors, we will look at a number of different metrics determining the fit of the model to the data we have. Here we chose to use \(C_p\), BIC, and the Adjusted \(R^2\) values. A lower \(C_p\) and BIC score corresponds to a stronger fit, while a higher Adjusted \(R^2\) corresponds to a better fit. The following code shows the scores for each score when a certain number of predictors are included.
par(mfrow = c(1,3))
plot(reg_summary$cp,type = "b",xlab = "Number of variables",ylab = "Cp")
ind_cp = which.min(reg_summary$cp)
points(ind_cp, reg_summary$cp[ind_cp],col = "red",pch = 20)
plot(reg_summary$bic,type = "b",xlab = "Number of variables",ylab = "BIC")
ind_bic = which.min(reg_summary$bic)
points(ind_bic, reg_summary$bic[ind_bic],col = "red",pch = 20)
plot(reg_summary$adjr2,type = "b",xlab = "Number of variables",ylab = "Adjusted R2")
ind_adjr2 = which.max(reg_summary$adjr2)
points(ind_adjr2, reg_summary$adjr2[ind_adjr2],col = "red",pch = 20)
From these plots, we can expect a model with 3 or 4 predictors seems to minimize the \(C_p\) and BIC scores while maximizing the Adjusted \(R^2\) value. Thus, a model containing this number of predictors better fits to the data in our dataset, so we should aim to include this number of variables. However, the selection of the variables have an effect on the fit of the model, so we should look more closely at which variables to choose.
We know we want to include 3-4 predictors to optimize the fit of our model to our data from the previous section and we have some idea of which these might be from before as well. Now, we’ll further examine which variables we might include in our model. To help us determine this, we will look at the statistical significance of each variable in the data. In particular, we are interested in the \(p\) values. A \(p\) value tells us how certain we are that the impact of a variable is nonzero in our model. However, it does not quantify the impact of the variable (it only tells us that there is some impact of that variable on ACT scores).
fit <- lm(average_act ~ . + percent_lunch:state, data = edgap_analysis)
round(summary(fit)$coefficients,3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.542 1.039 25.536 0.000
## rate_unemployment 2.267 2.593 0.874 0.383
## percent_college 0.965 0.953 1.013 0.312
## percent_married 0.973 0.661 1.473 0.142
## median_income 0.000 0.000 0.144 0.886
## percent_lunch -13.835 0.876 -15.786 0.000
## stateWA -3.500 0.732 -4.782 0.000
## percent_lunch:stateWA 5.243 1.140 4.601 0.000
So, we can see that the intercept, percent_lunch, stateWA, and percent_lunch:stateWA have significantly smaller \(p\)-values than the other variables. These variables correspond to the impact of percent free and reduced lunch and distinguishes the impact of this variable in Washington and New York. In other words, the percent_lunch variable is the most likely to be nonzero, along with the likelihood it has a different impact in the two states. Since these \(p\)-values are so small, we can be pretty certain there exists a correlation between the percent free and reduced lunch and average ACT score (like in the entire dataset), but that the correlation in Washington and New York differs. This confirms what we saw in our exploratory plots.
Now, we will look at the precise values of the coefficients in our model. Here, we explicitly look at the coefficients for a model using the intercept, percent_lunch, stateWA, percent_lunch:stateWA to predict ACT scores, due to their high statistcal significance. This also allows us to more accurately isolate the impact of percent free and reduced lunch in the two states. Here, we’ll round to three decimal places.
fit <- lm(average_act ~ percent_lunch*state, data = edgap_analysis)
round(summary(fit)$coefficients,3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.553 0.628 45.489 0
## percent_lunch -14.561 0.815 -17.865 0
## stateWA -3.804 0.707 -5.382 0
## percent_lunch:stateWA 5.485 1.107 4.956 0
More explicitly,
round(coef(fit),3)
## (Intercept) percent_lunch stateWA
## 28.553 -14.561 -3.804
## percent_lunch:stateWA
## 5.485
From this we can find the precise regression models in each state.
New York State: \[(\text{Average ACT Score in NY}) = 28.553 - 14.561\times(\text{% free & reduced lunch in NY})\] This means that the New York 'baseline average score' for the ACT is about 28.5, with standard error of about 0.6. However, the average ACT score drops by about 1.5 (standard error about 0.08) for every 10% increase in percent of free and reduced lunch.
Washington State: \[(\text{Average ACT Score in WA}) = 24.749 - 9.076\times(\text{% free & reduced lunch in WA})\] This means that the Washington 'baseline average score' for the ACT is about 24.7. However, the average ACT score drops by about 0.9 for every 10% increase in percent of free and reduced lunch.
Thus, the impact of percent free and reduced lunch in Washington is less than it is in New York, by approximately 0.6 on the average ACT score for every 10% increase in percent of free and reduced lunch.
Now that we’ve conducted this analysis, we can return to the original questions we defined at the outset of this problem. Now, we should be able to better understand which of the five factors we chose most correlates with the average ACT score. We should also be able to quantify the strength of these relationships as well as discuss its behavior. This will allow us to determine which of the variables seems to be the most important in predicting the average ACT score of a school, which helps us to begin asking the question of how we can combat education inequality. It’s important to note here, before looking more closely at any of the results, that these results show only correlation, and not causation, so improper manipulation of any of these variables, such as ending the policy of free & reduced lunch policies will not improve ACT scores. Rather, we should use these metrics as inspiration that point us in the direction for further research digging into the root causes of education inequality.
Using linear regression models, we were able to show that the correlation between each of the socioeconomic variables we discussed was strong, as shown by the \(p\)-values. When we fit tried to fit the data with 5 models with each socioeconomic factor as the lone predictor, we found that they all correlated strongly. In fact, each of these models showed a \(p\)-value very near zero for the coefficient of the correlation between each socioeconomic factor and the average ACT score at a school when considered as the only predictor. That is, every one of the five socioeconomic factors has a statistically significant correlation to the average ACT score. However, we also computed other metrics to analyze the fit of these 5 models, such as the \(R^2\) value and residual standard error (RSE). From these, we can determine the percentage of the variance in the data that the model explains and the average error of its predictions, respectively.
According to these measures, we found that the median household income of the geographic region near the school was moderately able to predict the school’s average ACT score. The two variables had a positive correlation and, after a logarithmic transform of the data, the model was able to explain 21.6% in the variance of the ACT score and was, on average, 2.22 points off on it’s predictions of the ACT score of the school. Also with a positive correlation, the percentage of adults with a college degree explained 21.0% in the variance of the ACT score and made predictions for the ACT score that were off by 2.23 points on average. The possibility of a higher order polynomial regression was considered, but the quadratic term was determined to be statistically insignificant and was thus excluded. The unemployment rate, in contrast, correlated negatively with the average ACT score of a school, but explained only 18.8% in the variance of the ACT score and predicted ACT scores that were on average 2.26 points off from the actual value. Here, a square root transformation provided only a small improvement over the linear model, but due to the skewed distribution of the unemployment rate, the relative infrequency of data points with high unemployment rates meant this difference had a muted effect on the \(R^2\) and RSE values. The percentage of children in a married couple family also showed a positive correlation with and explained 19.4% in the variance in ACT scores. This model using only this factor to predict ACT scores was on average wrong by 2.25 points. Although the model had a statistically significant quadratic term, this term failed to improve the accuracy significantly. The best predictor by far was the percentage of students on free and reduced lunch at a school. This variable had a negative correlation with the average ACT score of the school that explained 61.6% in the variance of the ACT score. The residual standard error of this model is 1.56 points meaning that the model’s predicted ACT score is on average only off by about 1.56 points from the school’s actual average ACT score. While this may not seem like much of an improvement, to be consistently 0.7 points more accurate on average over the entire dataset represents a marked improvement in accuracy.
However, this analysis was conducted when each variable as the sole predictor. Once we considered the full set of variables as potential predictors at once, the percentage of students receiving free and reduced lunch was again easily the most significant predictor. To determine this, we conducted the best subset selection for a linear regression model considering all of the five socioeconomic factors we were interested in. Then, we found the best model uses the 3 predictors according to BIC \(C_p\) and adjusted \(R^2\), which were the percentage of the student population at a school that qualified for free and reduced lunch, the unemployment rate, and the percentage of adults with a college degree. Analysis of the \(p\)-values of each of the coefficients for these variables in the model showed that they were all statiscally significant - that is, with very small \(p\)-values. Thus, we could be fairly certain that all of these variables had an nonzero impact on our prediction of the average ACT score of a school. So, to compare these impacts and determine the relative importance of each of these predictors, we normalized the predictors so that they were all on similar scales. This allows us to directly compare the magnitude of the coefficients, which tells us for which variable a similar change would produce the greatest change in the predicted average ACT score. In other words, this allowed us to directly compare the impact of each variable. Doing this, we found that the coefficient on the percentage of students receiving free & reduced lunch is an order of magnitude larger than the other coefficients. This supports the conclusion that percentage of students receiving free & reduced lunch is the most important predictor of the average ACT score of the school, and by some margin. This conclusion is further strengthened when we analyze the performance of this model, since the model considering these three variables had a RSE of 1.522 and an \(R^2\) value of 0.631, meaning that it performed extremely similarly to the performance of the model considering only the percentage of students receiving free & reduced lunch as the sole predictor.
# this allows subplots for ggplot
library(patchwork)
p0 <- ggplot(fit_pl,aes(x=percent_lunch, y=average_act)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(title = "Percent Lunch as Sole Predictor", x = "Percent Free & Reduced Lunch", y = "Predicted Average ACT") +
theme_bw()
p1 <- ggplot(fit_pl,aes(x=percent_lunch, y=.resid)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Percent Free & Reduced Lunch", y = "Residuals") +
theme_bw()
p0 + p1
Therefore, it appears that out of the five socioeconomic factors we chose to examine, the percentage of students receiving free & reduced lunch at a school is easily the most important predictor of the average ACT score of the school. Again, it is important to highlight that this does not mean that a low percentage of students receiving free & reduced lunch causes higher ACT scores, but that it correlates with it. The true cause of the correlation is unknown by this analysis. Allong with this point, it is important to consider other limitations of our results. While there are many of these, they main ones can be segmented into two main categories: limitations in the data we used and limitations in our analysis.
We can be confident in the quality of the data we did use, since we sourced it from the Census and EdGap databases, both of which have reliable checks on the accuracy of the data. However, as in any model building application, our results could have been improved by improving our data. One of the main shortcomings of our data is the disconnect between the Census data, which is collected on a geographic basis, and the EdGap data, which is collected at a school level. Thus, there exists a fundamental disconnect between these two datasets that may not be reconciled perfectly. For instance, not every child attends the school that is nearest them geographically. Additionally, the Census data measures the population at large, not just the student population. So, median income on the census, for instance, does not correlate to the median income at a school, since the Census data includes individuals without children. Another large shortcoming of the data we used was that it only included entries from 20 states, meaning that the majority of states were overlooked in our dataset. This could have a large impact on the data, as our additional step shows how these correlations can vary greatly between two states. We could also have considered the effects of many more socioeconomic factors beyond the relatively small number of five that we decided to study in detail. Furthermore, data on private schools either does not exist or is prohibitively difficult to access and use, meaning that our data completely excludes this potentially large and impactful subset of the student population. We also are forced to compare academic performance using ACT scores and, at times, the transformations of their SAT counterparts since this is the most reliable data we have access to. While these scores may not reflect intelligence or academic performance accurately, they do play a significant role in many college admissions processes and thus carry with them huge impacts on the educational opportunities available to students. However, the two tests are not perfectly compatible, and so we must admit that, even after translation, the two test scores may not be perfectly compatible. Further variability in the data we fail to consider includes variation in policy, such as the threshold for free & reduced lunch and the requirement (or lack thereof) to take either the ACT or SAT in certain states. The final limitation in our data that I will mention here is that it is derived from only one academic year, which could mask certain properties of the data. All of these limitations in data, and many others there is not space to mention, limit the conclusions we can draw and could potentially introduce biases into our results, since our analysis does not apply to these underrepresented or missing areas.
The methods we used to analyze the data are useful for answering the questions we set out to answer (mostly regarding the correlation between these factors and ACT scores), but other models and approaches could help us answer other questions or predict the data more successfully. Improvements in this area could be achieved by any number of means, including, but not limited to considering more models. While the linear regression models we used are well adapted to the problem of understand which factors are ‘most’ important predictors and how they interact, perhaps we could have produced more accurate predictions with a neural network or other model, albeit perhaps without the same clear insight on the importance of each variable as a predictor. Also, since our analysis relies on the data we feed it, all the limitations we discussed in our data flow into our model and limit the conclusions we can draw from them.
Therefore, our work has shown that the percentage of students receiving free & reduced lunch has a shockingly high correlation with the average ACT score of a school in this US. While this points at a large inequality in public education, we cannot draw any firm conclusions about the causes or draw up policy based solely on our results. Instead, its important to research more into the root cause of this inequality and determine the best way to solve this issue at its root. Still then, it is more than likely that the problem of education inequality will persist due to other factors that deserve equal scrutiny to this one.